Code
library(tidyverse)
library(here)
library(DT)
library(gt)
library(kableExtra)
# Imports data
names <- read_csv(here("StateNames_A.csv"))
# Interactive Data Table
DT::datatable(names)Our dataset(s) in this lab concerns baby names and their popularity over time. At this link, you can find the names for ALL 50 states, in separate datasets organized by first letter. For each year, and for each name with at least 50 recorded babies born, we are given the counts of how many babies had that name.
Let’s take a look at how the name “Allison” has changed over time. As my name begins with “A”, you should download the StateNames_A.csv dataset from the link above.
library(tidyverse)
library(here)
library(DT)
library(gt)
library(kableExtra)
# Imports data
names <- read_csv(here("StateNames_A.csv"))
# Interactive Data Table
DT::datatable(names)each state should be its own row
and each sex should have its own column
if there were no babies born for that combination of state & sex there should be a 0 (not an NA)
The dataset has a column titled Gender, which contains two values "F" and "M", representing “Female” and “Male”. The sex someone was assigned at birth is different from their gender identity (definitions). Thus, this variable should be renamed to Sex or Sex at Birth.
allison_f which contains only the babies assigned Female at birth.This should be the data set you use for the remainder of the assignment.
Make a visualization showing how the popularity of the name “Allison” has changed over the years. To be clear, each year should have one observation–the total number of Allisons born that year.
allison <- names |>
rename(Sex = Gender) |>
filter(Name == "Allison") |>
group_by(State, Sex) |>
summarize(Total = sum(Count),
.groups = "drop") |>
pivot_wider(names_from = Sex,
values_from = Total,
values_fill = 0) |>
print()# A tibble: 51 × 3
State F M
<chr> <dbl> <dbl>
1 AK 232 0
2 AL 1535 0
3 AR 1198 0
4 AZ 1880 0
5 CA 12413 0
6 CO 1594 0
7 CT 1099 0
8 DC 321 0
9 DE 294 0
10 FL 4455 0
# … with 41 more rows
allison |>
knitr::kable(format = "html",
col.names = c("State",
"Female",
"Male"),
caption = "Frequency of Allisons in Each State") |>
kableExtra::kable_classic(html_font = "Arial") |>
kableExtra::kable_styling(bootstrap_options = "striped")| State | Female | Male |
|---|---|---|
| AK | 232 | 0 |
| AL | 1535 | 0 |
| AR | 1198 | 0 |
| AZ | 1880 | 0 |
| CA | 12413 | 0 |
| CO | 1594 | 0 |
| CT | 1099 | 0 |
| DC | 321 | 0 |
| DE | 294 | 0 |
| FL | 4455 | 0 |
| GA | 3257 | 0 |
| HI | 183 | 0 |
| IA | 1477 | 0 |
| ID | 451 | 0 |
| IL | 5110 | 0 |
| IN | 3067 | 0 |
| KS | 1283 | 0 |
| KY | 1905 | 20 |
| LA | 1209 | 0 |
| MA | 2218 | 0 |
| MD | 2229 | 0 |
| ME | 340 | 0 |
| MI | 4014 | 0 |
| MN | 2374 | 0 |
| MO | 2882 | 0 |
| MS | 817 | 0 |
| MT | 226 | 0 |
| NC | 3435 | 0 |
| ND | 285 | 0 |
| NE | 807 | 0 |
| NH | 412 | 0 |
| NJ | 3052 | 0 |
| NM | 399 | 0 |
| NV | 729 | 0 |
| NY | 5747 | 0 |
| OH | 5487 | 0 |
| OK | 1421 | 0 |
| OR | 1186 | 0 |
| PA | 4307 | 0 |
| RI | 306 | 0 |
| SC | 1228 | 0 |
| SD | 376 | 0 |
| TN | 2488 | 0 |
| TX | 10192 | 0 |
| UT | 1125 | 0 |
| VA | 3220 | 0 |
| VT | 135 | 0 |
| WA | 1956 | 0 |
| WI | 2367 | 0 |
| WV | 813 | 0 |
| WY | 142 | 0 |
allison_f <- names |>
rename(Sex = Gender) |>
filter(Name == "Allison")
allison_f|>
group_by(Year) |>
summarize(Total = sum(Count)) |>
ggplot(mapping = aes(x = Year,
y = Total)) +
geom_line() +
labs(y = NULL,
title = "Total Number of Allisons Born Between 1997-2014") +
scale_x_continuous(breaks = seq(1997, 2014, by = 1),
guide = guide_axis(n.dodge = 2)) +
scale_y_continuous(breaks = seq(5000, 9000, by = 1000),
limits = c(5000, 8000)) +
theme(plot.title = element_text(face = "bold",
hjust = 0.5))allison_f# A tibble: 914 × 5
Name Year Sex State Count
<chr> <dbl> <chr> <chr> <dbl>
1 Allison 1997 F AK 19
2 Allison 1998 F AK 10
3 Allison 1999 F AK 13
4 Allison 2000 F AK 14
5 Allison 2001 F AK 9
6 Allison 2002 F AK 14
7 Allison 2003 F AK 14
8 Allison 2004 F AK 13
9 Allison 2005 F AK 10
10 Allison 2006 F AK 18
# … with 904 more rows
Source:
https://stackoverflow.com/questions/55948425/why-is-scale-y-continuous-not-working-here-no-error-message-given
https://www.datanovia.com/en/blog/ggplot-title-subtitle-and-caption/#:~:text=Change%20the%20font%20appearance%20%28text%20size%2C%20color%20and,font%20face%20include%3A%20%E2%80%9Cplain%E2%80%9D%2C%20%E2%80%9Citalic%E2%80%9D%2C%20%E2%80%9Cbold%E2%80%9D%20and%20%E2%80%9Cbold.italic%E2%80%9D.
Fit a linear model with the year as the explanatory variable, and the number of Allisons as the response. Similar to #3, each year should have one observation–the total number of Allisons born that year.
Write out the estimated regression equation.
Estimated Total Number of Allisons = 209689.8 -101.5(Year)
***Note: Year must be substituted with the actual Year (i.e. 1997, 2014, etc.)
Plot the residuals of the model, that is, the actual values minus the predicted values. Comment on the residuals - do you see any patterns?
There is slight fanning in the residuals plot with the residuals forming a concave up parabola.
The augment() function from the broom package may be helpful!
We can not conclude that the name “Allison” is decreasing in popularity over the years 1997 to 2014 for babies assigned “female” at birth from this model. This is due to the slight curvature in the residuals plot that suggests that we should not use a straight line to model this data.
allison_lm <- allison_f |>
group_by(Year) |>
summarize(Total = sum(Count)) |>
lm(Total ~ Year, data = _) |>
print()
Call:
lm(formula = Total ~ Year, data = summarize(group_by(allison_f,
Year), Total = sum(Count)))
Coefficients:
(Intercept) Year
209815.1 -101.6
allison_lm |>
broom::augment() |>
ggplot(mapping = aes(y = .resid, x = .fitted)) +
geom_point() +
labs(x = "Fitted",
y = NULL,
title = "Residual Plot") +
scale_x_continuous(breaks = seq(5000, 7000, by = 500),
limits = c(5000, 7000)) +
scale_y_continuous(breaks = seq(-1000, 1000, by = 500),
limits = c(-1250, 1250)) +
theme(plot.title = element_text(face = "bold",
hjust = 0.5))In middle school I was so upset with my parents for not naming me “Allyson”. Past my pre-teen rebellion, I’m happy with my name and am impressed when baristas spell it “Allison” instead of “Alison”. But I don’t have it as bad as my good friend Allan!
It looks like you want to filter for a vector of values. What tools have you learned which can help you accomplish this task?
each spelling should be its own column
each state should have its own row
a 0 (not an NA) should be used to represent locations where there were no instances of these names
In the year 2000, 21.364% of all Male birthed babies were named some form of “Allan”, “Alan”, or “Allen” were born in California and 2.869% were born in Pennsylvania. Looking closer, in the California selection: 19.86% were named Allen, 65.34% were named Alan, and 14.785% were named Allan. In the Pennsylvania selection: 47.05% were named Allen, 42.85% were named Alan, and 1.01% were named Allan.
allans <- tibble(Name = c("Allan",
"Alan",
"Allen"))
allan_names <- names |>
filter(Gender == "M") |>
semi_join(allans, by = "Name") |>
rename("Sex" = "Gender")
allan_names |>
group_by(Name, Year) |>
summarize(Total = sum(Count)) |>
ggplot(mapping = aes(x = Year,
y = Total,
color = Name)) +
geom_line() +
labs(y = NULL,
title = "Total Number of Allans Born Between 1997-2014") +
scale_x_continuous(breaks = seq(1997, 2014, by = 1),
guide = guide_axis(n.dodge = 2)) +
scale_y_continuous(breaks = seq(0, 3500, by = 1000),
limits = c(0, 3500)) +
theme(plot.title = element_text(face = "bold",
hjust = 0.5)) allan <- allan_names |>
pivot_wider(names_from = "Name",
values_from = "Count",
values_fill = 0) |>
filter(Year == 2000) |>
mutate(Total = Allan + Allen + Alan,
across(.cols = Allen:Allan, .fns = ~.x/Total,
.names = "{col}Prop"),
OverallProp = Total/sum(Total)) |>
select(Year, Sex, State, Allen, AllenProp, Alan, AlanProp, Allan,
AllanProp, Total, OverallProp) |>
print()# A tibble: 43 × 11
Year Sex State Allen AllenProp Alan AlanProp Allan AllanP…¹ Total Overa…²
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 M AL 27 0.529 24 0.471 0 0 51 0.0123
2 2000 M AR 12 0.4 18 0.6 0 0 30 0.00723
3 2000 M AZ 22 0.186 85 0.720 11 0.0932 118 0.0285
4 2000 M CA 176 0.199 579 0.653 131 0.148 886 0.214
5 2000 M CO 14 0.2 51 0.729 5 0.0714 70 0.0169
6 2000 M CT 12 0.462 14 0.538 0 0 26 0.00627
7 2000 M DC 0 0 6 1 0 0 6 0.00145
8 2000 M FL 78 0.375 97 0.466 33 0.159 208 0.0502
9 2000 M GA 42 0.333 79 0.627 5 0.0397 126 0.0304
10 2000 M HI 7 0.467 8 0.533 0 0 15 0.00362
# … with 33 more rows, and abbreviated variable names ¹AllanProp, ²OverallProp
allan |>
knitr::kable(format = "html",
col.names = c("Year",
"Sex",
"State",
"Allen",
"Allen Proportion in State",
"Alan",
"Alan Proportion in State",
"Allan",
"Allan Proportion in State",
"Total",
"Overall State Proportion in Year"),
caption = "Summary of Frequency and Proportion of 'Allen', 'Allan, and 'Allan' in Each State") |>
kableExtra::kable_classic(html_font = "Arial") |>
kableExtra::kable_styling(bootstrap_options = "striped")| Year | Sex | State | Allen | Allen Proportion in State | Alan | Alan Proportion in State | Allan | Allan Proportion in State | Total | Overall State Proportion in Year |
|---|---|---|---|---|---|---|---|---|---|---|
| 2000 | M | AL | 27 | 0.5294118 | 24 | 0.4705882 | 0 | 0.0000000 | 51 | 0.0122980 |
| 2000 | M | AR | 12 | 0.4000000 | 18 | 0.6000000 | 0 | 0.0000000 | 30 | 0.0072341 |
| 2000 | M | AZ | 22 | 0.1864407 | 85 | 0.7203390 | 11 | 0.0932203 | 118 | 0.0284543 |
| 2000 | M | CA | 176 | 0.1986456 | 579 | 0.6534989 | 131 | 0.1478555 | 886 | 0.2136484 |
| 2000 | M | CO | 14 | 0.2000000 | 51 | 0.7285714 | 5 | 0.0714286 | 70 | 0.0168797 |
| 2000 | M | CT | 12 | 0.4615385 | 14 | 0.5384615 | 0 | 0.0000000 | 26 | 0.0062696 |
| 2000 | M | DC | 0 | 0.0000000 | 6 | 1.0000000 | 0 | 0.0000000 | 6 | 0.0014468 |
| 2000 | M | FL | 78 | 0.3750000 | 97 | 0.4663462 | 33 | 0.1586538 | 208 | 0.0501567 |
| 2000 | M | GA | 42 | 0.3333333 | 79 | 0.6269841 | 5 | 0.0396825 | 126 | 0.0303834 |
| 2000 | M | HI | 7 | 0.4666667 | 8 | 0.5333333 | 0 | 0.0000000 | 15 | 0.0036171 |
| 2000 | M | IA | 9 | 0.4285714 | 12 | 0.5714286 | 0 | 0.0000000 | 21 | 0.0050639 |
| 2000 | M | ID | 6 | 0.4285714 | 8 | 0.5714286 | 0 | 0.0000000 | 14 | 0.0033759 |
| 2000 | M | IL | 60 | 0.3191489 | 114 | 0.6063830 | 14 | 0.0744681 | 188 | 0.0453340 |
| 2000 | M | IN | 25 | 0.3289474 | 38 | 0.5000000 | 13 | 0.1710526 | 76 | 0.0183265 |
| 2000 | M | KS | 10 | 0.4166667 | 14 | 0.5833333 | 0 | 0.0000000 | 24 | 0.0057873 |
| 2000 | M | KY | 15 | 0.5357143 | 13 | 0.4642857 | 0 | 0.0000000 | 28 | 0.0067519 |
| 2000 | M | LA | 30 | 0.5660377 | 18 | 0.3396226 | 5 | 0.0943396 | 53 | 0.0127803 |
| 2000 | M | MA | 16 | 0.2711864 | 35 | 0.5932203 | 8 | 0.1355932 | 59 | 0.0142272 |
| 2000 | M | MD | 15 | 0.3947368 | 23 | 0.6052632 | 0 | 0.0000000 | 38 | 0.0091633 |
| 2000 | M | ME | 0 | 0.0000000 | 7 | 1.0000000 | 0 | 0.0000000 | 7 | 0.0016880 |
| 2000 | M | MI | 64 | 0.4740741 | 55 | 0.4074074 | 16 | 0.1185185 | 135 | 0.0325537 |
| 2000 | M | MN | 28 | 0.5283019 | 25 | 0.4716981 | 0 | 0.0000000 | 53 | 0.0127803 |
| 2000 | M | MO | 26 | 0.4814815 | 20 | 0.3703704 | 8 | 0.1481481 | 54 | 0.0130215 |
| 2000 | M | MS | 19 | 0.7037037 | 8 | 0.2962963 | 0 | 0.0000000 | 27 | 0.0065107 |
| 2000 | M | NC | 56 | 0.3943662 | 70 | 0.4929577 | 16 | 0.1126761 | 142 | 0.0342416 |
| 2000 | M | NE | 0 | 0.0000000 | 7 | 1.0000000 | 0 | 0.0000000 | 7 | 0.0016880 |
| 2000 | M | NH | 0 | 0.0000000 | 5 | 1.0000000 | 0 | 0.0000000 | 5 | 0.0012057 |
| 2000 | M | NJ | 28 | 0.2718447 | 66 | 0.6407767 | 9 | 0.0873786 | 103 | 0.0248372 |
| 2000 | M | NM | 9 | 0.3103448 | 20 | 0.6896552 | 0 | 0.0000000 | 29 | 0.0069930 |
| 2000 | M | NV | 10 | 0.2127660 | 29 | 0.6170213 | 8 | 0.1702128 | 47 | 0.0113335 |
| 2000 | M | NY | 80 | 0.3149606 | 131 | 0.5157480 | 43 | 0.1692913 | 254 | 0.0612491 |
| 2000 | M | OH | 79 | 0.5163399 | 64 | 0.4183007 | 10 | 0.0653595 | 153 | 0.0368941 |
| 2000 | M | OK | 16 | 0.4705882 | 18 | 0.5294118 | 0 | 0.0000000 | 34 | 0.0081987 |
| 2000 | M | OR | 14 | 0.4117647 | 20 | 0.5882353 | 0 | 0.0000000 | 34 | 0.0081987 |
| 2000 | M | PA | 56 | 0.4705882 | 51 | 0.4285714 | 12 | 0.1008403 | 119 | 0.0286954 |
| 2000 | M | RI | 0 | 0.0000000 | 6 | 1.0000000 | 0 | 0.0000000 | 6 | 0.0014468 |
| 2000 | M | SC | 22 | 0.5945946 | 15 | 0.4054054 | 0 | 0.0000000 | 37 | 0.0089221 |
| 2000 | M | TN | 37 | 0.4933333 | 33 | 0.4400000 | 5 | 0.0666667 | 75 | 0.0180854 |
| 2000 | M | TX | 105 | 0.1962617 | 381 | 0.7121495 | 49 | 0.0915888 | 535 | 0.1290089 |
| 2000 | M | UT | 12 | 0.4444444 | 15 | 0.5555556 | 0 | 0.0000000 | 27 | 0.0065107 |
| 2000 | M | VA | 36 | 0.4000000 | 42 | 0.4666667 | 12 | 0.1333333 | 90 | 0.0217024 |
| 2000 | M | WA | 27 | 0.3375000 | 43 | 0.5375000 | 10 | 0.1250000 | 80 | 0.0192911 |
| 2000 | M | WI | 19 | 0.3333333 | 26 | 0.4561404 | 12 | 0.2105263 | 57 | 0.0137449 |
allan_ca_pa <- allan |>
filter(State %in% c("CA", "PA")) |>
print()# A tibble: 2 × 11
Year Sex State Allen AllenProp Alan AlanProp Allan AllanProp Total Overa…¹
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 M CA 176 0.199 579 0.653 131 0.148 886 0.214
2 2000 M PA 56 0.471 51 0.429 12 0.101 119 0.0287
# … with abbreviated variable name ¹OverallProp
allan_ca_pa |>
knitr::kable(format = "html",
col.names = c("Year",
"Sex",
"State",
"Allen",
"Allen Proportion in State",
"Alan",
"Alan Proportion in State",
"Allan",
"Allan Proportion in State",
"Total",
"Overall State Proportion in Year"),
caption = "Summary of Frequency and Proportion of 'Allen', 'Allan, and 'Allan' in California and Pennsylvania",
digits = 3) |>
kableExtra::kable_classic(html_font = "Cambria") |>
kableExtra::kable_styling(bootstrap_options = "striped",
position = "left")| Year | Sex | State | Allen | Allen Proportion in State | Alan | Alan Proportion in State | Allan | Allan Proportion in State | Total | Overall State Proportion in Year |
|---|---|---|---|---|---|---|---|---|---|---|
| 2000 | M | CA | 176 | 0.199 | 579 | 0.653 | 131 | 0.148 | 886 | 0.214 |
| 2000 | M | PA | 56 | 0.471 | 51 | 0.429 | 12 | 0.101 | 119 | 0.029 |